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About This Manual 


Objectives of This Manual 

The CMMD Reference Manual describes the CMMD library, a library of commu¬ 
nication routines used for creating message-passing programs (sometimes called 
MIMD programs) to run on the Connection Machine CM-S supercomputer. It pro¬ 
vides 


a a brief introduction to the library and to the host/node message-passing 
model that it implements. 

■ a “quick reference” list of routines provided by the library, organized by 
which processors (host, node, or both) and how many processors (one, two 
or more, or all) can or must call the routine. 

■ reference chapters for each functional group of routines. These chapters 
provide information on the routines themselves and, in some cases, on the 
way in which the routines function and the uses to which they may be put 


Intended Audience 

This manual is written for programmers who are developing or porting message¬ 
passing programs to run on the Connection Machine CM-S supercomputer. It 
assumes some previous knowledge of message-passing programming. 


Related Documents 

CMMD User’s Guide: The CMMD Reference Manual should be used in 
conjunction with the CMMD User s Guide, which provides an introduction to the 
CM-S supercomputer itself and to the manner in which message-passing 
programs execute on that machine. Programmers new to the CM-S supercom¬ 
puter are urged to read the first two chapters of the user’s guide before beginning 
programming on the machine. 

Later chapters of the user’s guide describe the tools for compilation, linking, 
debugging, and program analysis. 
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Manual Pages: The reference descriptions for individual routines provided in 
this manual are also available on-line as manual pages accessible via the man 
command. 


Revision Information 

This edition of the CMMD Reference Manual documents Version 1.1 of the 
CMMD library. Readers should note that this library is still under development 
and is therefore subject to change. 


Notation Conventions 

The table below displays the notation conventions observed in this manual. 


Convention 

Meaning 

bold typewriter 

CMMD functions, and UNIX and CM System Soft¬ 
ware commands, command options, and filenames, 
when they appear in syntax statements or em¬ 
bedded in text. 

italics 

Argument names and placeholders in function and 
command formats. 

typewriter 

Code examples and code fragments. 

% bold typewriter 

typewriter 

In interactive examples, user input is shown in 
bold typewriter and system output is shown in 
regular typewriter font. 
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Customer Support 


Thinking Machines Customer Support encourages customers to report errors in Connec¬ 
tion Machine operation and to suggest improvements in our products. 

When reporting an error, please provide as much information as possible to help us iden¬ 
tify and correct the problem. A code example that failed to execute, a session transcript, 
the record of a backtrace, or other such information can greatly reduce the time it takes 
Thinking Machines to respond to the report 

If your site has an Applications Engineer or a local site coordinator, please contact that 
person directly for support. Otherwise, please contact Thinking Machines’ home office 
customer support staff: 

U.S. Mail: Thinking Machines Corporation 

Customer Support 
245 First Street 

Cambridge, Massachusetts 02142-1264 

Internet 

Electronic Mail: customer-siqiport@think.com 

UUCP 

Electronic Mail: ames ! think! customer-support 

Telephone: (617) 234-4000 

(617) 876-1111 


IX 





Chapter 1 

Introduction 


1.1 Introducing CMMD 

The CM message-passing library, CMMD, provides facilities for cooperative 
message passing between processing nodes. It thus provides simple inter- 
processor communication that falls outside the range of the CM data parallel 
languages. 

This library is expected to be of particular interest to users who have written C 
or Fortran programs for machines with MIMD architectures. Such users can port 
their programs to the CM-5 by replacing the original message-passing library 
calls with calls to CMMD routines. 


The Cooperative Message-Passing Model 

CMMD supports a programming model frequently referred to as host/node pro¬ 
gramming. This model involves two simultaneously running programs. One 
program runs on the host, while independent copies of the node program run on 
each processing node. On the CM-5, the host is the partition manager (PM) that 
controls a partition of the system, while the nodes are the processing nodes within 
the partition. The host begins execution by performing needed initializations (in¬ 
cluding initializing the CMMD library) and then invoking the node program; it 
may have little involvement in subsequent computations. 

Within this general programming model, CMMD permits cooperative concurrent 
processing, in which synchronization occurs only between matched sending and 
receiving nodes and only during the act of communication. At all other times, 
computing on each node proceeds asynchronously. 
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This initial release of CMMD primarily supports blocking message sending and 
receiving, but does provide limited support for non-blocking message passing as 
well. (Future versions of the library are expected to offer further support for 
asynchronous message passing.) Blocking routines are synchronized routines in 
which senders wait for their recipients to respond before continuing execution, 
and vice versa. Programmers using such routines must ensure that each sending 
routine is matched with a receiving routine, or deadlock may ensue. (The CM-5 
timesharing operating system ensures that any such deadlock affects only the 
erring program, and has no effect on other programs sharing the partition.) 

In addition, global functions provide for broadcasting data from and reducing it 
to the host, for scan and reduce operations, and for global sychronization. (Like 
their data parallel counterparts, CMMD global functions are able to take advan¬ 
tage of the CM-5’s hardware support for global communications.) 


Two Exceptions 

Two exceptions to the cooperative message-passing format exist. The first is a 
facility for sending non-blocking short messages. Using this facility, each node 
can send one short message to one or more other nodes and then continue its 
program without waiting for a response. Only one message from one given node 
to another can be outstanding; sending two or more messages to the same node 
requires some synchronization. 

For example, node 1 can send node 3 a short message, then perform computa¬ 
tions without waiting for node 3 to receive the message. If node 1 then sends a 
second message to node 3, the system software will check the status of the first 
message. If that first message has been received, the second is also sent as a 
non-blocking message. If, however, the first message has not yet been received 
by node 3, the second send will block until receipt of the first message. 

The second exception is a pair of routines that operate outside the CMMD mes¬ 
sage-passing protocol and thus allow programmers to define their own protocols. 
These routines should be used only by programmers who are highly experienced 
in writing message-passing programs, as they provide almost no safeguards 
against disaster. 
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CMMD and Other CM Software 

CMMD can be called from C and from Fortran 77. This manual documents the 
C interface (that is, it uses C syntax and data types). Section 1.3, at the end of this 
chapter, shows the relationship between C data types and Fortran data types. 

CMMD routines are completely compatible with the current release of the CM-S 
operating system, CMOST Version 7.1. Programs under the control of CMMD 
routines, however, cannot make calls to data parallel CM libraries, such as paral¬ 
lel I/O or graphics routines. Standard (serial) C calls can be used: UNIX I/O calls 
from the host program, for instance, or Xlib graphic routines. Future versions of 
the CMMD library are expected to make provision for moving data between 
CMMD and data parallel programming modes. 

Please note that this library is under continual development and hence subject to 
possibly substantial changes. 

This manual provides information on the CMMD routines. See the CMMD User’s 
Guide for information on compiling, loading, use of timers, and debugging. 


1.2 How Many Nodes? 

Synchronization of processors under the message-passing model affects different 
numbers of processors according to the operation being performed. 

■ When one node sends a message, and a second receives it, those two nodes 
must synchronize. Until both have made their respective calls and the mes¬ 
sage is transferred, neither call can return. 

■ If more than two nodes are involved in a set of messages (which can hap¬ 
pen in send_and_receive calls), all those nodes must complete their calls 
before any of the calls can return. 

■ When a global function is invoked, no call can return until every node (and 
sometimes the host) has made the call. 

■ Informational functions usually involve only one node; for example, any 
node may check whether it has a message pending without involving any 
other node. 
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Programs using CMMD calls have the responsibility of checking that all requisite 
nodes make the appropriate calls at the appropriate times. If this is not done, pro¬ 
gram performance will suffer and deadlock may ensue. 

Please note that global routines can be used only when all processors in the parti¬ 
tion take part. If some section of a program involves only a single subset of 
processors, it cannot make a global call on that subset without hanging the entire 
program. 

The chart on the next several pages summarizes CMMD routines by functionality 
and by the number and identity of nodes that must call them. Once you are ac¬ 
quainted with the library, you can use this chart as a quick reference. 

Succeeding chapters discuss each functional group of routines and provide refer¬ 
ence writeups for each routine. 
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CMMD Function Summary 


Single-Node Functions: Host Only 


Enabling and Disabling Library Use 

CMMD_enable () 
CMMD_is_enabled () 
CMMD_disable() 
CMMD_suspend() 
CMMD_is_suspended () 
CMMD resume() 


Global Synchronization 

CMMD_barrier_sync() 


Version 1.1, January 1992 





CMMD Reference Manual 


Single-Node Functions: Host or Any Node 


Informational Functions 

CMMD_self_address() 
CMMD_hos t_node() 
CMMD_partition_size() 

CMMD_bytes_received() 
CMMD_bytes_sent() 
CMMD_msg_sender() 
CMMD__msg_tag () 


Polling 

CMMD_insg_pending {int node, ini tag) 


Setting and Getting Global Or 

CMMD_set_global_or ( int value) 
CMMD_get_global_or() 


Sending Short Messages 

CMMD_send_short ( int destination, int tag, void *bujfer, int len) 
CMMD_wait for send {int destination) 
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Two-Node Functions 

(Note: In any of these functions, a single node may play both roles, being both 
sender and receiver.) 


Sending and Receiving Messages 

CMMD_send (int destination, int tag, void *buffer, int len ) 

CMMD_send_v (int destination, int tag, void *buffer, int elemjen, 
int stride, int elem_cnt) 

CMMD_raceive {int source, int tag, void *buffer, int len) 

CMMD_receive_v (int source, int tag, void *buffer, int elemjen, 
int stride, int elemjcnt) 

CMMD_send_and_receive {int source, int source Jag, void *inbujfer, 
int inlen, int destination, int destjag, void *outbuffer, 
int outlen) 

CMMD_send_and_receive_v {int source, int source Jag, void 

*inbuffer, int in_elemjen, int in_stride, int in_elem_cnt, 

int destination, int destjag, void *outbuffer, 

int out_elemJen, int out_stride, int out_elem_cnt) 

CMMD_swap (int processor, void *inbuffer, int inlen, void *outbuffer, 
int outlen) 

CMMD_swap_v (int processor, void *inbujfer, int in_elemjen, 
int in stride, int in_elem_cnt, void *outbuffer, 
int out_elemJen, int out stride, int out_elem_count) 
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Global Functions: All Nodes, but Not Host 


Global Synchronization 

CMMD_sync_with_nodos() 


Reduce, Scan, and Concatenate 

CMMD_reduce_<type> ( <type> value, CMMD combinerJ combiner) 

CMMD_acan_<type> { <type> value, CMMD_combiner_t combiner, 
CMMD_scan_direction_t direction, 

CMMD segment mode t smode, intsbit, 
CMMD_scanJnclusionJ inclusion ) 

CMMD_conca t_vith_nodas (void *element, void *buffer, 
int elemjength) 
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Global Functions: Host plus All Nodes 


Enabling and Disabling Short Message Sending 

CMMD_enable_short_massages() 
CMMD_disable_short_messages<) 


Broadcast 

a®4D_bc_from_host ( void *buffer, int len) 
CMMD__receive_bc_ffram_hoat (void *buffer, int len) 

C3MMD_distrib_to_nodes {void *buffer, int elemjength) 
CMMD_receive_alement_from_host {void *buffer, int length) 


Global Synchronization 

CMMD_sync_host_with_nodes() 
CMMD_sync_with_host () 


Reduce and Concatenate 

CMMD_reduce_from_nodes_<type> (<i^/7e> value, 
CMMDcombinerJ combiner) 

CMMD caduc<5 _to_host_<typ«> ( <type> value, 
CMMDcombinert combiner) 

a4MD_gathar_from__nodes {void *buffer, int elemjength) 
CMMD_concat_elements_to_host {void *element, int elemjength) 
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1.3 C and Fortran 77 

Fortran 77 calling sequences for CMMD routines are identical to C calling 
sequences, in terms of routine names, parameter names, and parameter order. 

Data types, however, are declared differently. The following table shows transla¬ 
tions from C data types to Fortran 77 data types. 


c 

Fortran 

int 

integer 

char 

character 

CMMD combiner t 

integer 

CMMD scan direction t 

integer 

CMMD_segment_mode_t 

integer 

CMMD scan inclusion t 

integer 

unsigned 

integer 

float 

real 

double 

double precision 


In the ANSI C progr amming language, void is a special data type that has no 
meaningful values. The equivalent of a Fortran subroutine (a subprogram that 
returns no value) is expressed in C as a function whose return type is void. 

A widespread C programming convention is that the type “pointer to void” repre¬ 
sents a pointer to any desired type. If a subroutine has a formal parameter of type 
“pointer to void”, then a pointer of any type may correctly be used as the corre¬ 
sponding actual argument. The called routine must then assume or deduce the 
properties of the data pointed to, usually from information conveyed by the other 
parameters. 

The CMMD library uses this convention for all cases in which an argument is a 
pointer to an area of memory that either contains data to be sent or is reserved for 
data to be received. Pointers indicate only the starts of memory areas; the sizes 
of the areas are specified through other parameters. 
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2.1 Initializing CMMD 

Data parallel and message-passing program execution make different demands 
on the CM-S’s communications networks (the Control Network and the Data Net¬ 
work), and thus require different settings for network participation. For 
message-passing programs using CMMD, these settings are controlled by two 
pairs of functions, which must be called from the host. The first pair, CMMD_ 
enable and CMMD_disable, perform the initial tasks necessary first to enable 
message passing and later to disable message passing and restore the network 
setting to the state it was in when CMMD_enabla or CMMD_resume was last 
called. The second pair, CMMD_suspend and CMMD_resume, are used to suspend 
and resume message passing temporarily within the course of a program (for ex¬ 
ample, to allow use of some other library). 

Programs or routines using CMMD should therefore begin with the host calling 
CMMD_anable and end with the host calling CMMDdiaable. Calls by the host 
to CMMD_suspend and CMMD_r«suiM may be placed where necessary within the 
program (if they are needed). 

Each of these calls requires that the system be in the appropriate state: for in¬ 
stance, an error results from trying to disable message passing when it is not 
enabled. Therefore, two informational routines are provided: CMMD_is_ 
enabled tells whether message passing has been enabled; CMMD_is_ 
suspended tells whether message passing is currently suspended. 
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2.2 Initializing the Short Message Facility 

At this initial release, CMMD uses a model of cooperative, or loosely synchro¬ 
nous, message passing. A short message facility within CMMD does, however, 
allow the non-blocking sending and receiving of short messages (up to 16 bytes). 
This facility must be enabled and disabled separately from CMMD itself. A pro¬ 
gram enables CMMD and starts passing messages. At some point, when the 
sending of short messages is useful, the program enables that facility, creating 
short-message buffers on all die nodes. When the facility is no longer useful, it 
may be disabled and its buffer space reclaimed. If the facility is still enabled 
when CMMD itself is disabled, it will be disabled automatically as part of the 
overall disabling. 

The routines that enable and disable the sending of short messages are cmmd_ 
anable_short_messages and CMMD_disable_short_messagas. These 
routines must be called synchronously by all nodes and the host; they are dis¬ 
cussed at the end of this chapter. 


2.3 Functions That Initialize CMMD 

CMMD_enabl a () 

CMMD_enable must be called by the host at the beginning of any program that 
uses CMMD routines. It records the current states of communications in the net¬ 
works, allocates space for message buffers in the host and the nodes, and 
initializes variables needed for message-passing operations, and synchronizes 
the host and nodes. 


CMMD_is_enabled() 

CMMD_is_enabled returns TRUE if CMMD is currently enabled (that is, if it has 
been enabled and is not suspended). Otherwise, it returns FALSE. Only the host 
can call this function. 
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CMMDjdisable () 

CMMD dlsable must be called by the host at the termination of a program that 
uses CMMD routines. It synchronizes the host with the nodes, deallocates the 
space originally allocated in the host and the nodes for message buffers, and re¬ 
stores die original states of the communications networks. (That is, it returns the 
networks to the state found when CMMDanable or CMMD_resume was last 
called.) 

An error is signaled if CMMD is not currently enabled. If it has been suspended, 
it must be resumed before it can be disabled. 


CMMD_suspend() 

CMMD_suspand returns control temporarily to the host processor, to allow data 
parallel processing. The routine synchronizes the host with the nodes, saves the 
current states of the communication networks, and restores the states that the net¬ 
works were in before the latest CMMD_enable or CMMD_resuma was called. This 
routine can be called only by the host. 

CMMD_suspend signals an error if CMMD has not been enabled or if it is already 
suspended. 


CMMD_is_suspendad() 

CMMD i s_su spended returns TRUE if message passing has been enabled and 
then suspended; otherwise, it returns FALSE. Only the host can call this routine. 


CMMD_resume() 

If CMMD has been suspended, CMMDresume saves the current states of the com¬ 
munications networks and restores the communications network states in effect 
before the last call to CMMD jsuspend. The user program should ensure that host 
and nodes are synchronized after making this call before beginning message 
passing again. 
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CMMD resum* can be called only by the host. It returns an error if CMMD is not 
in a suspended state. 


2.4 Functions That initialize the Short Message Facility 

CMMD_enable_short_njassages () 

CMMD_anable_short_inassages synchronizes the host and all nodes and allo¬ 
cates internal storage necessary to support the non-blocking sending of short 
messages via the CMMD_send_short function. It must be called on the host and 
all nodes. 

CMMD_«nabla_short_messages has no effect on a program’s ability to use 
CMMD calls other than CMMD_send_short. All standard CMMD calls can be 
used while the send_short facility is enabled. 

An error is signaled if the facility is already enabled. 


CMMD_disable_short_nessages() 

CMMD_disabla_short_me8sages disables the non-blocking sending and 
receiving of short messages. It must be called on the host and all nodes. 

On each node, the call waits until all short messages sent from this node have 
been received (e.g., by CMMD_receive). It then frees the internal storage allo¬ 
cated on that node for short message support. 

If short message passing is enabled at the time that CMMD itself is disabled, then 
this function is called internally by CMMD_cLisable. 

An error is signaled if this function is called when the facility is not enabled. 
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Processor Information 


Processors, both host and nodes, must address each other explicitly during mes¬ 
sage passing. Therefore, routines are needed to provide host and node identifiers. 
CMMD_ho s t_noda provides the host identifier, while CMMD_s a 1 f_addre s s 
provides the calling node’s own identifier. 

For each partition, the set of node identifiers consists of the integers from 0 to the 
number of nodes in the partition minus 1, inclusive. The function CMMD_parti- 
tionjsiz* returns the size of the current partition. The host identifier is an 
integer outside the range of the partition size. 


3.1 Processor Information Functions 

CMMD_sal£_address () 

Called from a process running on a given node, CMMD_self_address returns 
the node identifier for that node. 

Node identifiers are integers, from 0 to the maximum number of processors in the 
partition -1, inclusive. For example, every 128-node partition contains nodes 
0 to 127. Node identifiers are logical identifiers: programs and programmers 
need never concern themselves with physical processor addresses. 
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CMMD_ho s t_noda () 

CMMD_hos t_noda returns the host identifier (an integer not in the partition set). 
It can be called from die host itself or from any node. 


CMMD_partition_siza () 

CMMD_partition_size returns the number of processors in the current parti 
tion. It can be called from the host or from any node. 
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Message Passing 


m 


4.1 Introduction 

Blocking and Non-Blocking Message Passing 

This initial version of CMMD primarily supports cooperative message passing, 
in which the sending and receiving of messages are synchronized. Most of the 
message-passing routines discussed in this chapter fit this model. They not only 
pass information from one node to another, but also synchronize the nodes in so 
doing. They are therefore called blocking routines. 

CMMD does, however, allow the non-blocking sending and receiving of short 
messages (up to 16 bytes). This facility must be enabled and disabled separately 
from CMMD itself, using the routines CMMD_enabla_short_massages and 
CMMD_disabla_short_maasagas. These routines, which must be called by 
host and all nodes, are discussed in Chapter 2. 

Two routines are used to send short messages: CMMD_send_short to actually 
send the message, CMMD_wait_for_send to allow users to impose some mea¬ 
sure of synchronization, should they wish to do so. These routines are discussed 
in Section 4.4, at the end of this chapter. No special routines are needed for 
receiving short messages: CMMD_recaive or a®4D_recaiva_y may be used. 
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Patterns of Message Passing 

A processor can play one of four roles in message passing: 

■ It can send a message. 

* It can receive a message. 

* It can send and receive messages simultaneously. Two special cases: 

■ It can take part in a cshift, in which all nodes simultaneously send 
(in one direction) and receive (from another direction). 

■ It can take part in a swap, in which it and one other processor ex¬ 
change messages, simultaneously sending to and receiving from 
each other. 

Routines are provided for each of these roles: send, receive, send and receive, 
and swap. These routines are discussed in Sections 4.2 and 4.3. 


Regular Messages and Vector Messages 

Message-passing routines support two types of messages: standard messages, in 
which bytes are stored in normal sequential order, and vector messages, in which 
elements are separated by some amount of space. Each of the routines in this 
section, therefore, has two versions: a standard version, and a vector version 
(labeled with a final _v). 

In a vector message, the distance between the starting position of one element 
and the starting position of the next element is referred to as the “stride.” The 
stride includes one element plus the intervening space before the beginning of the 
next element. Normally, therefore, the stride is larger than the element size. 


Element Element 



Stride Stride 
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4.2 Functions for the Paired Sending and 
Receiving of Messages 

4.2.1 Sending Messages 

CMMD_send (int destination, int tag, void *buffer, int len) 

CMMD_send_v (int destination, int tag, void ^buffer, int elemjen, 
int stride, int elem_cnt ) 

destination An integer identifying the node to which the message is 

to be sent 

tag An integer from 0 to 127, inclusive, which serves as a 

label for the message. 

*buffer A pointer to a buffer that contains the message to be sent. 

len The length of the buffer, in bytes. 

elemjen (Vector sends only.) An integer specifying the length of 

each element in the vector. 

stride (Vector sends only.) An integer specifying the distance in 

bytes between the starting addresses of vector elements. 

elemjmt (Vector sends only.) An integer specifying the number of 

elements in the vector. 

CMMDjs«nd and CMMD_send_v send the contents of a buffer of specified length, 
tagged with the specified tag, to the given destination node. The node must be 
inside the partition; otherwise, an error results. (The symbol default_msg_tag 
is the standard default tag.) 

Buffers may be of any length up to the maximum memory per node. A NULL 
buffer pointer or a length of zero causes a message of zero data length to be sent. 

The message is not sent until the receiving node acknowledges that it is ready to 
receive a message labeled with the specified tag from this node. In its response, 
the receiving node specifies the maximum length of the message it is willing to 
receive. Normally, this is the same as the length specified by CMMD_send, but it 
may be either larger or smaller. 

For example, if the receiving node does not know the length of the message to 
be sent to it, it can specify the maximum buffer length (or whatever shorter length 
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seems a reasonable maximum for the type of message expected) and accept as 
many (or as few) bytes as the sender desires to send. 

On the other hand, if the receiving node does not have room for the full message 
that the sender wishes to send, it can signal that it wishes to receive a shorter 
message. CMMD_send is constrained to send no more data than the receiver has 
signaled that it can accept. (Please note: This is an implementation-dependent 
constraint that may be lifted at some future release.) Thus, it sends either the 
amount it planned to send or the amount CMMD_receive allows, whichever is 
less. 

After sending whatever amount of data it is allowed to send, CMMD_send returns; 
it returns a value of 0 if it sent its entire message and a value of 1 if it sent a 
smaller amount. In the latter case, or in the case in which CMMD_receive allo¬ 
cates a “maximum-length” buffer, the program should call CMMD_bytes_sent 
to get the number of bytes actually sent. 


Standard Sends and Vector Sends 

A standard message, sent by CMMD_send, begins at the starting place identified 
by the *bujfer argument, and proceeds for len sequential bytes. A vector mes¬ 
sage, sent by CMMD_send_y, takes a number of non-sequential elements from the 
buffer, and sends those as a sequential message. (In other words, it performs an 
implicit gather.) 

Normally, the stride specified for CMMD_send_v will be larger than the element 
length. This difference creates the vector send: elemjen bytes are put into the 
message, then {stride - elemjen) bytes are skipped over, then the next elemjen 
bytes are added to the message, and so on, until the specified number of elements 
has been placed in the message to be sent. (See Figure 1). 

If the stride and element length are specified as being equal, the result is the same 
as a non-vector send: {elemjen * elem_cnt) bytes are sent. 

If the stride is smaller than the element length, CMMD_send_y sends elemjen 
bytes starting at each stride. For example, a stride of 0 would result in the same 
element being sent elem_cnt times. 

Note that you do not specify the total length of the message in a vector call. Rath¬ 
er, the length is the result of multiplying the number of elements by the length of 
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each element. Note also that unless the element length and the stride are identical 
(in which case you are using a vector call to do a standard send), the buffer itself 
must be longer than the message to be sent from it, since its length must equal 
the number of elements multiplied by the stride. Figure 1 illustrates stride, ele¬ 
ment length, element count, message length, and buffer length for a vector send. 


Buffer. 

(1 square = 1 byte) 
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for this send: Elemjen = 4 
Stride = 8 
Elem cnt = 2 
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Figure 1. A vector send. 


As an example of regular and vector sends, let us consider the case of a 4 x 6 
matrix A, filled with self-addresses from 0 to 23, in which each element is one 
byte long, laid out in memory as follows: 
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To send the top row of the matrix as a message to node 5, you would use the call 

CMMD_send (5, DEFAULT_MSG_TAG, &A, 6) 

To send the first column of the matrix to node 3, on the other hand, you would 
need a vector send, stating that you were sending four elements ( elem_cnt ), each 
one byte long ( elemjen ), located six bytes apart {stride). 

CMMD_send_v (3, DEFAULT_MSG_TAG, &A, 1, 6, 4) 

Normally, the receiving node would accept the first message with a standard re¬ 
ceiving call (CMMD_re ce ive) and the second with a vector receiving call 
(CMMD_receive_v), thus preserving the original geometry of the data. They are 
not, however, required to do so. Indeed, you could transpose this sample matrix 
by sending each row as a sequential message, but having each received as a 
six-element vector with a stride of 4. 


More about Vector Sends 

Vector sends, like standard sends, are constrained by the destination’s receive 
request. A sending node offers to send {selem-count * selem-length) bytes; a re¬ 
ceive message agrees to accept ( delem-count * delem-length). The smaller 
number of the two is sent, in the following manner 

(1) Each element of the source is sent in its entirety until the appropriate 
number of bytes sent is reached. 

(2) If selemjen != delemjen , the source elements will be broken up and 
distributed across die destination’s element length (not across its stride). 

Note that this is in contrast to what some might expect. CMMD calls DO NOT send 
only as many bytes of each source element as will fit in each destination element. 
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For example, if selemjen > delemjen 

selem_len = 5, sstride = 8, 
delem_len = 2, dstride = 3 

the source buffer would contain 
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and the destination buffer (after the operation) would contain 
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On the other hand, if selemjen < delemjen 

selem_len = 2, sstride = 5, 
delem_len = 3, dstride = 4 

the source buffer would contain 
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and the destination buffer (after the operation) would contain 
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4.2.2 Receiving Messages 

CMMD_receive (int source, int tag, void *buffer, int len) 

CMMD_receive_v (int source, int tag, void *buffer, int elemjen, int stride, 
int elemjmt) 

An integer identifying the node from which the message 
is to be sent (ANY_NODE allows any node to be the 
sender). 

An integer from 0 to 127, inclusive, which serves as a 
label for the message (any_tag allows receipt of a mes¬ 
sage labeled with any tag). 

A pointer to a buffer that will contain the message to be 
received. 

(Non-vector function only.) The length of the buffer, in 
bytes. 

(Vector functions only.) An integer specifying the length 
of each element in the vector, in bytes. 

(Vector functions only.) An integer specifying the dis¬ 
tance in bytes between the starting addresses of the 
vector elements. 

(Vector functions only.) An integer specifying the num¬ 
ber of elements in the vector. 

CMMD_receive and CMMD_raceive_v inform the source node that they are 
ready to receive a message of len bytes with a specified tag; they then wait for 
a message with the given tag to be sent from the given source. These routines can 
take the special symbol ANY_NODE as the source argument, indicating that any 
source is acceptable, and the symbol ANY_TAG as the tag argument, indicating 
that any tag will be accepted. 

If ANY_NODE is given, the program can call the function CWMDjmsg_sender () 
to get the node identifier of the actual sender; if any_tag is used, 
CMMD_msg_tag () can be called to get the tag of the accepted message. 

Once an acceptable message is sent, CMMDjreceive and CMMD_receive_v 
copy the message into the specified buffer. They return a value of 0 if the number 


source 

tag 

*buffer 

len 

elemjen 

stride 

elem cnt 
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of bytes received equals len; otherwise they return 1, and CMMD_bytes 
received 0 can be called to get the number of bytes actually received. 


Standard Messages and Vector Messages 

All messages sent by CMMD calls are packed in sequential order. For many, this 
is the actual data ordering: CMMD_receive handles this type of message. 

Other messages, however, send data that is not to be considered sequential: an 
array section would be one example. In this case, CMMD_receive_v is used, and 
the call specifies that the information to be received is to be considered a vector 
of e elements ( elem_count ), each m bytes long ( elemjength), each element to be 
placed in an area of the buffer that is n bytes long. 

The placement of the data in the buffer thus depends on the relationship between 
stride and elemjen : 

■ If stride > elemjen (the usual case) the elements will be placed in the 
specified buffer at intervals, each separated by (n minus m) bytes. 

■ If stride = elemjen, then the elements are placed sequentially in the buff¬ 
er, as for a standard receive. 

■ If stride < elemjen, subsequent elements overwrite previous ones where 
they overlap. 

CMMD_send_v and CMMD_receive_y are frequently paired, so that data is 
received in the same geometry from which it was sent. It is possible, however, 
to receive data in a geometry different from that in which it was sent: for instance, 
sequential data may be broken into a vector (thus “scattering” the data), or a vec¬ 
tor received as sequential (thus “gathering” it). Figure 2 illustrates these four 
possible patterns. 
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Figure 2. Sending and receiving data. 
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4.3 Simultaneous Sends and Receives 

4.3.1 in Any Pattern 

Q 4 MD_s«nd_and_raceive {int source, int source Jag, void *inbuffer, intinlen, 

int destination, int destjag, void *outbuffer, 
int outlen > 

CMMD_sand_and_raceive_v(»rf source, int sourcejag, void *inbuffer, 

int inelemjen, int injstride, int in_elem_count, 
int destination, int destjag, void *outbuffer, 
int out elem len, int out_stride, 
int outelemcount) 

source An integer identifying the node from which a message 

will be received by this node. 

sourcejag An integer, 0-127 inclusive, or ANY_TAG, labeling the 

message to be received. 

*inbuffer Pointer to the buffer that will contain the message to be 

received. 

inlen (Non-vector functions only.) Length, in bytes, of the buff¬ 

er to hold the message received by this node. 

inelemjen (Vector functions only.) Length, in bytes, of each element 
in the vector to be received by this node. 

in_stride (Vector functions only.) Number of bytes between starting 

addresses of elements in the vector that comprises the 
message to be received by this node. 

in_elem_count (Vector functions only.) Number of elements that com¬ 
prise the vector to be received by this node. 

destination An integer identifying the node to which this node will 

send a message. 

destjag An integer, 0-127 inclusive, labeling the message that 

will be sent by this node. 

*outbuffer A pointer to the buffer holding the message to be sent by 

this node. 
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outlen (Non-vector functions only.) Length, in bytes, of the buff¬ 

er to be sent by this node. 

out elemjen (Vector functions only.) Length, in bytes, of each element 

in the vector to be sent by this node. 

outjstride (Vector functions only.) Number of bytes by which start¬ 

ing addresses of elements in the vector to be sent are 
separated. 

out_elem_count (Vector functions only.) Number of elements that com¬ 
prise the vector to be sent by this node. 

These two functions allow nodes to send and receive messages simultaneously. 
The routines can be used to perform common grid communication, or to send and 
receive in more random patterns. Any number of nodes can take part in one of 
these calls; the only requirement is that each node must both send a message and 
receive a message. (See CMMDjswap and CMMD_swap_v for a simpler way to 
send and receive simultaneously when two nodes are involved, each serving as 
both source and destination for the other.) 

The functions cause the message in the calling node’s outbuffer to be passed to 
the destination node at the same time that a message is read into the calling 
node’s inbuffer from the source node. The buffers may overlap. 

CMMD_send_and_receiva and CMMD_send_and_recaive_v do not return 
until the calling node has sent one message and received one. They return TRUE 
if the number of bytes received equals inlen and the number of bytes sent equals 
outlen ; otherwise they return FALSE, and CMMD_bytes_reeeived () and 
CMMD_bytes_sent () can be called to get the number of bytes received and 
sent, respectively. 

CMMD_send_and_receive handles sequential data, while CMMD_aend_ 
and_receive_v exhibits gather/scatter behavior. 


4.3.2 Further Notes 

(1) The strides for sent and received messages do not have to be equal. For 
example, to perform a transpose in which the sends are vectored and the 
receives are sequential, set in_stride as needed for the sends and set it 
equal to in_elem_len for the receives. (For more information on vector 
messages, see the entry for CMMD_send.) 
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(2) The sendandreceive functions should be used when a program needs 
to perform circular shifts on an array. Each node sends in one direction 
and receives from another direction, as in the example diagrammed in 
Figure 3 below. 



Figure 3. A circular shift on 4 nodes. 


(3) Sends and receives may be mixed with send_and_receive functions. For 
example, you might mix these calls in order to create an end-off shift on 
four nodes: 

Node 0: CMMD_send: uses boundary value, sends to node 1 
Node 1: CMMD_send_and_receive : receives from 0, sends to 2 
Node 2: CMMD_send_and_receive : receives from 1, sends to 3 
Node 3: CMMD receive: receives from 2 


4.3.3 Swaps: An Exchange between Two Nodes Only 

CMMD_swap ( intprocessor, void *inbuffer, int inlen, void *outbuffer, int outlen) 

CMMD_swap_v (int processor, void *inbuffer, int in_elemjen, int instride, 
int in_eIem_count, void *outbuffer, int out_elem_len, 
int outjstride, int out_elem_cnt) 

processor An integer identifying the node with which a message is 

to be swapped. 
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*inbuffer 

inlen 

in_elem_len 

in_stride 

in_elem_count 

*outbuffer 

outlen 


A pointer to the buffer that will hold the received 
message. 

(CMMD_swap only.) Length, in bytes, of the buffer that 
will hold the message received by this node. 

(CMMD_swap_y only.) Length, in bytes, of each ele¬ 
ment in the vector to be received by this node. 

(CMMD_swap_v only.) Number of bytes between start¬ 
ing addresses of elements in the vector to be received by 
this node. 

(CMMD_swap_v only.) Number of elements that com¬ 
prise die vector to be received by this node. 

A pointer to the buffer holding the message to be sent by 
this node. 

(CMMDjswap only.) Length of the buffer to be sent by 
this node. 


out_elem_len (CMMD_swap_v only). Length, in bytes, of each element 
in the vector to be sent by this node. 

out_stride (CMMD_swap_v only.) Number of bytes by which start¬ 

ing addresses of elements in the vector to be sent are 
separated. 

out_elem_count (CMMD_swap_v only.) Number of elements that com¬ 
prise the vector to be sent by this node. 


CMMD_swap is identical to CMMD_send_and_receive (and CMMD_swap_v to 
CMMD_send_and_recaive_v) where the source node equals the destination 
node. 


For an explanation of sequential versus vector routines, see CMMDjsend. 
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CMMD_wait_for_send (int destination) 

destination An integer identifying the node to which the message is 

to be sent. 

CMMD_wai t_f or_send checks to see whether a prior short message from this 
node to the specified destination node is outstanding (not yet received). If such 
a message exists, the function waits until that message has been received (e.g., 
by CMMD_receive). If destination is any_node, the function waits until all pre¬ 
vious messages from this node to any destination have been received. 

Before sending a message to the specified node (n), CMMD_send_short(n,...) 
automatically calls CMMD_wait_f or_s«nd(n), thus ensuring that a second send 
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to node n does not occur until the first has been received. A program would call 
CMMD_wai t_f or_send explicitly if the programmer wanted to ensure that a 
message to one node was received before a message to another node was sent, 
but did not want to wait immediately after the first send. The pattern might be 

send_short to node n 
do some other stuff 
call wait_for_send on node n 
send short to nodem 
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Polling 



Message-passing programs need some way of identifying whether, at any given 
time, there are messages that are either in transit or waiting to be sent. To identify 
such messages, a program polls for them. 

A process on any individual node may call CMMD_msg_pending () to poll for 
a message, and issue a message-receive call only after it knows that a message 
is waiting to be sent. This allows the process to avoid having to block while wait¬ 
ing for a message. (A receiving process that relies on polling but polls 
infrequently may, of course, cause sending processes to block while waitin g for 
the receiver.) 


5.1 Polling Function 

CMMD_msg_pending {int node, int tag) 

node Integer identifying a node. (May be ANYJNODE.) 

tag Integer identifying a tag. (May be any_tag.) 

CMMD_msg_pending returns TRUE if there is a message waiting to be received 
from the specified node (or from any node if any_node is supplied as the node 
argument) with tag tag (or any tag if ANYJTAG is supplied as the tag argument). 
It returns false otherwise. 

If ANY_NODE is used, the function CMMD_msg_sendar () can then be called to 
get the node identifier of the pending sender; if any_tag is used, CMMD_msg tag 
will return the tag of the pending message. 
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Auxiliary Routines 




These are the routines that tell you what really happened when you sent that mes¬ 
sage from one node to another: How much was sent or received? By what node 
was it sent? How was it tagged? Although their obvious uses are as responses to 
return values of 1 (signifying incomplete transmission or reception) or to the re¬ 
ception of messages sent from ANY_NODE or labeled with ANY_tag, these 
informational routines can be called at any point during a program. 

CMMD_bytas_raceived () Returns the number of bytes received by this 

node in its most recent message. 

CMMD_bytas_sent () Returns the number of bytes sent in the last 

message. 

C34MD_mag_sendar () Returns the node identifier for the last message 

received except when issued following a call to 
CMMD_mag_p«nding. In that case: 

■ If the call to CMMD_msg_pending 
returned TRUE, CMMD_msg_sander 
returns the identifier of the node that is 
waiting to send a message. 

■ If the call to CMMD_msg_pending 
returned FALSE, calling CMMD_ms g_ 
sender causes an error. 
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Returns the tag of the last message received 
excqpt when issued following a call to CMMD_ 
msg_ponding. In that case: 

■ If the call to CMMD_msg_pending 
returned TRUE, CMMD_msg_tag returns 
the tag of the message that is waiting to 
be received. 

■ If the call to CMMD_msg_j>ending 
returned FALSE, calling CMMD_msg_ 
tag causes an error. 
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Broadcasts 


Broadcasts are messages sent from the host to all nodes. Two kinds exist: The 
host may broadcast the entire contents of the buffer to all nodes (in which case 
all receive identical data) or it may parcel out elements from the buffer among 
all nodes, one element per node. 

All nodes receive data simultaneously, and all receive the same amount of data. 
For this reason, it is very important to ensure that all nodes have sufficient buffer 
space to hold the broadcast message. 

The host and all the nodes must take part in these broadcasts. Once a broadcast 
is signaled, either by the host or by any node, the hardware begins checking for 
responses. Only when the hardware signals that the entire broadcast is complete 
can any of the broadcast calls return. 


7.1 Broadcasting the Entire Buffer to All Nodes 

CMMD_bc_f rom_host {void *bujfer, int len) 

CMMD_receive_bc_from_host {void *buffer, int len ) 

*buffer A pointer to a buffer that holds the message being broad¬ 

cast and received. 

len The length, in bytes, of the buffer being broadcast and 

received. 
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The host process calls CMMDjbc_from_hos t to broadcast a buffer of the 
specified length (in bytes) to all nodes. All nodes must call CMMD_receive_ 
bc_f rom_hos t, with the same length argument, to receive the buffer. 


PLEASE NOTE 

If length arguments are not identical across all nodes, a segmen¬ 
tation fault may result. 

Please note also that all processors within the partition must 
take part in this operation. If a given program divides the parti¬ 
tion into sections, an attempt to use global operations within a 
section will fail. 


These functions do not return until the broadcast is complete; that is, until the 
host and all the nodes have made their calls. 


7.2 Distributing a Buffer among the Nodes 

CMMD_distrib_to_nodes {void *buffer, int elemjength) 

CMMD_receive_element_f rom_host {void *buffer, int length) 

*buffer A pointer to the buffer that holds the messages being sent 

and received. 

For the host, the length of the buffer (in bytes) must be at 
least (CMMD_par tition_si ze () * elemjength). 

For a node, the length must be at least elemjength. 

elemjength The length (in bytes) of each element to be sent. 

length The length (in bytes) of the buffer that is to receive the 

element being sent. 
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The host process calls CMMD_distrib_to_nodea in order to distribute ele¬ 
ments of the given length from the specified buffer to each node in processor 
order. The length (in bytes) of the buffer on the host must be at least (CMMD_ 
partition_size () * elem_length). Only the first (CMMDjpartition_ 
six* () * elemjength ) bytes are sent; each node receives one element. 

In response to the host call, all nodes must call CMMD_recaive_al«nent_ 
from_host, specifying a buffer of the appropriate size to receive the element. 

Neither the host call nor any of the node calls return until all have been made and 
completed. 
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Global Synchronization 


Global synchronization functions, as their name implies, serve to synchronize all 
nodes (and optionally the host as well) at a given point in a program. Three ver¬ 
sions are provided: 

CMMD_sync_hoat_with_nod«s This pair of calls serves as a synchroni- 

CMMD_sync_with_host zation point for host and nodes together. 

CMMD_sync_wi th_nodas This call, sent by all nodes, allows them 

to synchronize themselves without the 
host’s participation. 

CMMD_barri®r_sync This call, sent only by the host, synchro¬ 

nizes host and nodes at the completion of 
all currently executing node functions. 

All processors in the partition must join in these calls. Once the host or any node 
has begun one of these synchronization calls, the CM hardware keeps track of 
responses, and allows none of the calls to return until all nodes (and the host, 
when needed) have made their call. 

In addition to these synchronous routines, two asynchronous global OR routines 
allow host and all nodes to signal to each other by contributing to a global OR and 
reading its results. 

CMMD_set_global_or Sent by host and all nodes, this call con¬ 

tributes a value (0 or nonzero) toward the 
creation of a global OR. 

CMMD_get_global_or Sent by host or any node, this call reads 

the current value of the global OR. 
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By using the CMMD_set_global_or function, each processor contributes to the 
global OR at an appropriate time; the hardware checks and updates the global 
value at frequent intervals; and individual processors read the value when de¬ 
sired. Thus, the global OR mechanism can be used as a non-blocking method of 
determining when all processors have reached a given state. All processors 
would start a task, for instance, by sending a 1. As each finished its share of the 
task, it would send a 0. By checking the value of the global OR (which would 
change to 0 only when all processors had finished), a processor could determine 
whether the whole task was complete and thus select its own next action. 

Please note: These asynchronous global OR functions should not be confused 
with the synchronous global-OR reduction operation, which is explained later, in 
the section on Scans, Reductions, and Concatenation. 


8.1 Global Synchronization Functions 

CMMD_sync_host_trith_nodes () 

CMMD_sync_irith_host () 

The host calls 04 MD_sync_hos t_*i th_nodes to synchronize itself and all the 
nodes. The nodes respond by calling CMMD_sync_tri th_hos t. These calls re¬ 
turn only after the host and all nodes have made the call. 


CMMD_sync_wi th__noda s () 

CMMD_sync_with_nodes synchronizes the calling node with all other nodes. 
Once one node has made this call, all nodes must; the function does not return 
until they do. (Note that this routine does not involve the host.) 
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CMMD_barrier_sync () 

A program running on the nodes of a CM-5 system alternates between two states: 
It can be executing a procedure, or it can be in the dispatch loop, waiting for the 
host to initiate execution of a procedure. (For more information about this execu¬ 
tion process, see Chapter 2 of the CMMD User s Guide.) 

The CMMD_barrier_sync function is called by the host only. It synchronizes 
the host with the completion of all previously called node procedures. It returns 
only when all nodes have finished execution and have returned to the dispatch 
loop. 


PLEASE NOTE 

All host-node communication for all nodes in a given program 
block must be complete before the host processor makes this 
call. If the call is made while any communication between host 
and node is pending, the program will hang. 


CMMD_set_global_or (int value ) 

value An integer, either 0 or nonzero. 

Callable on any processor (host or node), OMMD_set_global_or allows a pro¬ 
cessor to contribute a value (either 0 or some nonzero integer) to a global OR 
function — that is, an OR in which host and all nodes may take part. The function 
returns when the value has been sent; it does not wait for participation by any 
other processor. 
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CMMD_get_global_or() 

Callable on either the host or the nodes, CMMD_get_global_or returns the cur¬ 
rent value of a global OR function over all processors, host and nodes alike. 

This function is asynchronous; it requires participation by no other processors. 

If CMMD_set_global_or has hot already set a value for the global OR, calls to 
CMMD_get_global_or return unpredictable results. 

As contributions to this global OR may be asynchronous, die hardware checks the 
value at frequent intervals and updates it as needed. Note, however, that some 
network delay exists during reception and propagation of values; thus, there is a 
small but actual window between the time at which a processor sends a 
set_global_or message and the time by which that message can affect the 
result of another node’s get_global_or request. 
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Chapter 9 

Scan, Reduction, and Concatenation 
Operations 


Scans, reductions, and concatenation are global operations. Given a buffer con¬ 
taining some value in each node, these global computations operate cumulatively 
on the buffer set to perform such tasks as 

■ s umming the value across all the nodes 
" finding the largest or smallest value 

■ performing a bitwise AND, OR, or XOR 

For reduction operations, the final value can be returned either to all the nodes 
or to the host For scans, the cumulative results are returned as a running tally 
across all the nodes. 

All nodes within the partition must take part in these calls. If the result is to be 
returned to the host, then the host must also take part. 

These global functions imp ose synchrony: those involving both host and nodes 
do not return until host and all nodes have made their (different) calls; those in¬ 
volving only nodes do not return until all nodes have made the call. 

Each scan and reduction function comes in four versions: one for integer, one for 
unsigned integer, and one each for single- and double-precision floating-point 
numbers. Each version requires as input a value of the type specified in its name, 
and returns a value of the same type. Exceptions to this rule are the float routines, 
which take float arg uments but return double results. 

Because scan and reduction functions may perform one of a number of opera¬ 
tions, they take as an arg umen t one of the following symbols representing the 
operation to be performed. 
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CMMD_combine r_add 

CMMD_combiner_uadd 

CMMD_combiner_max 

CMMD_combiner_umax 

CMMD_combiner_min 

CMMD_combiner_umin 

CMMD_combiner_ior 

CMMD_combinor_xor 

CMMD combiner and 


Add operations. 

Add operations (unsigned). 

Return the largest value found. 

Return the largest value found (unsigned). 
Return the smallest value found. 

Return the smallest value found (unsigned). 
Inclusive OR operation. 

Exclusive OR operation. 

Logical AND operation. 


Thus, for example, a CMMD_reduc®_int function, called using the CMMD_com- 
binerjnax argument, would compare the values on all nodes and return the 
largest value to all nodes, while a call to CMMD_r®duce_to_host with the 
CMMD_combin«r_add argument would add the values from all nodes and return 
the sum of the values to the host 


9.1 Reductions, Scans, and Segmented Scans 

Reduction operations, scans, and segmented scans provide three basic methods 
of all-to-all and all-to-one communication. (See Figure 4.) 


Reductions 

A reduce operation starts with values in every processor and ends with a single 
value, either in every node or in every node plus the host processor. Values may 
be added, so that the sum of all values is returned; or the largest or smallest value 
may be chosen; or an OR or XOR may be done across all the values. In each case, 
one final result is returned. 

Thus, on four nodes holding the values 

4 9 7 6 

a reduce/add would return 

26 26 26 26 
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Figure 4. Global summation operations. 


Scans 

A scan (sometimes called a parallel prefix operation) moves from processor to 
processor, in processor identifier order, creating a running tally of results in each 
processor. The function call specifies whether the scan proceeds upward (0 to «) 
or downward (« to 0), and whether the scan is to be inclusive or exclusive. (In 
an inclusive scan, the source value contained in any given node n contributes to 
the result for node n; in an exclusive scan, it does not.) 
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Thus, with our same four values, 

4 9 7 6 

an upward exclusive scan/add would produce 
04 13 20 

while a downward inclusive scan/add would produce 
26 22 13 6 


Segmented Scans 

In a segmented scan, independent scans are run simultaneously on different sub¬ 
groups (or segments) of the nodes. The segments are determined at run time by 
an argument called the sbit (described later in this chapter). For example, given 
our four values: 

4 9 7 6 

and sbit values of 

10 10 

an upward inclusive segmented scan/add would return 
4 13 7 13 


9.2 Concatenation 

Concatenation simply appends the value from each processor to the values of all 
preceding processors (in processor identifier order). CMMD provides two ver¬ 
sions of concatenation: one concatenates across the nodes only, and writes the 
resulting value into a buffer on every node. The other concatenates values from 
every node into a buffer on the host. Concatenation always proceeds from the 
lowest to the highest node; it is never segmented. 
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9.3 Reduction Operations 

CMMD_reduce_int (int value, CMMD_combinerJ combiner) 
CMMD__reduce uint ( unsigned value, CMMD_combinerJ combiner) 

CMMD_reduce float (float value, CMMD_combiner_t combiner ) 
GMMD_reduce_double ( double value, CMMD_combinerJ combiner) 

value The value to be contributed to the operation. Its type must 

match that specified by the function name. 

combiner One of the symbols listed below, specifying the type of 

operation to be performed. 

For signed integer operands (CMMD_reduce_int), al¬ 
lowable combiners are 

CMMD_cambiner_add CMMD_combiner_ior 

CMMD_combiner_max CMMD_combiner_xor 

CMMD_combiner_min CMMD_combiner_and 

For unsigned integer operands (CMMD_reduce_uint), 
allowable combiners are 

CMMD_combiner_uadd CMMD_combinar_ior 

CMMD_combinerumax CMMD_combinar_xor 

CMMD_combiner_umin CMMD_combiner_and 

For float and double operands (CMMD_reduce_float 
and CMMD_reduce_double), allowable operands are 

CMMD_combinar_add 

CMMD_combiner_max 

CMMD_combiner_min 

Using anything other than an allowable combiner causes a 
fatal error. 

The reduce functions return the value of the specified reduce operation over all 
the nodes. Every node thus receives the same return value. The functions will not 
return until all nodes have called CMMD_reduce_jfy/>e. The host processor is not 
involved. To involve the host processor, use the pair of routines described below, 
CMMD_reduce_from_nodes and CMMD_reduce_to_host. Note that these 
routines must be paired; it is an error to call CMMD_reduce on the nodes and 
CMMD reduce from nodes on the host. 
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CMMD_reduce_from_nodes_int {int value, CMMD_combiner_l combiner) 

CMMD_reduce_from_nodes_uint (unsigned value, CMMD_combiner_t 

combiner) 

CMMD_r«duce £rom_nodes_float (float value, CMMD_combiner_t 

combiner) 

CJO*D_reduce_from_nodes_dovible (double value, CMMD_combiner_t 

combiner) 


CMMD_reduca_to_host_int 

CMMD_reduce_to_host_uint 

CMMD_reduce_to_host_float 

CMMD reduce to host double 


(int value, CMMD ^combinerJ combiner) 
(unsigned value, CMMD_combiner_t 
combiner) 

(float value, CMMD_combinerJ 
combiner) 

(double value, CMMD_combiner_t 
combiner) 


value The value to be contributed by this processor to the opera¬ 

tion. Its type must match that specified by the function 
name. 


combiner One of the symbols listed below, specifying the type of 

operation to be performed. 

For signed integer operands, allowable combiners are 

CMMD_combiner_add CMMD_combiner_ior 

CMMD_combiner_max CMMD_combiner_xor 

CMMD_combiner_min CMMD_combiner_and 

For unsigned integer operands, allowable combiners are 

CMMD_combiner_uadd CMMD_combiner_ior 
CMMD_combiner_umax CMMD_combiner_xor 
CMMD_combinar_umin CMMD coiabin©;; and 

For float and double operands, allowable operands are 

CMMD_ccxnbiner_add 

CMMD_combiner_max 

CMMD_combinar_min 

Using anything other than an allowable combiner causes a 
fatal error. 


In this pair of functions, the host calls CMMD_ reduce_ f rom_nodas_fy/?e and 
all nodes call CMMD_reduce_to_hos t_type. The functions return to the host 
processor and to each node the value of the specified reduce operation over all 
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the nodes including the host processor. The functions will not return until all 
nodes have called CMMD_reduce_to_host_fype and the host has called CMMD_ 
reduce_from_nodes_fype. 


9.4 Scan Operations 

CMMD_s can_int (int value, CMMD combiner J combiner, CMMD_scan_ 

direction J direction, CMMD_segment_mode_t smode, 
int sbit, CMMD_scan_inclusionJ inclusion) 

CMMD_acan_uint (uint value, CMMD_combinerJ combiner, CMMD_ 

scanjlirectionj direction, CMMDjsegmentjnodet 
smode, int sbit, CMMD_scanJnclusionJ inclusion) 

CMfD_scan_f loat (float value, CMMDjombinerJ combiner, CMMD_ 
scanjlirectionj direction, CMMD_segment_ modej 
smode, int sbit, CMMD_scan inclusion t inclusion) 

CMMD_scan_double {double value, CMMD combiner t combiner, CMMD_ 
scan_ direction J direction, CMMD_segmentjnodeJ 
smode, int sbit, CMMD_scanJnclusionJ inclusion) 

value The value to be contributed by this processor to the opera¬ 

tion. Its type must match that specified by the function 
name. 

combiner One of the symbols listed below, specifying the type of 

operation to be performed. 

For signed integer operands (CMMD_scan_int), allow¬ 
able combiners are 

CMMD_combiner_add CMMD_combiner_ior 

CMMD_combiner_max CMMD_combiner_xor 

CMMD_combinar_min CMMD_combinar_and 

For unsigned integer operands (CMMD_scan_uint), 
allowable combiners are 

CMMD_combiner_uadd CMMD_combiner_ior 

CMMD_combiner_iimax CMMD_combinar_xor 

CMMD combiner umin CMMD combiner and 
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For float and double operands (CMMD_scan_float and 
CMMD_reduce_double), allowable operands are 

CMMDcombiner_add 
CMMD_combiner_max 
CMMD_combiner_min 

Using anything other than an allowable combiner causes a 
fatal error. 

direction CMMD_upward 

The scan starts at node 0 and proceeds to the highest- 
numbered node. 

CMMD_downward 

The scan starts at the highest-numbered node and pro¬ 
ceeds downward to node 0. 

smode CMMD_none 

The scan proceeds across all nodes. 

CMMD_segment_bit 

The scan is a segmented scan, with sbit acting as a seg¬ 
ment bit. 

CMMD_s tar t_bit 

The scan is a segmented scan, with sbit acting as a start 
bit 


sbit If sbit is nonzero, the node marks the boundary (usually 

the beginning) of a segment; if sbit is zero, the node is not 
a boundary marker. (If smode is CMMDjnone, then sbit is 
ignored.) 

inclusion CMMD_inclusive 

The scan is inclusive. 

CMMDjexdusive 

The scan is exclusive. 

CMMD_s can_type returns the value of the specified scan operation over all the 
nodes. This function does not return until all nodes have called the function. The 
host processor is not involved. 
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PLEASE NOTE 

(1) Values for direction , smode, and inclusion MUST be identi¬ 
cal across all nodes. Otherwise, results are unpredictable and 
the program may crash. 

(2) For CMMD_scan_float and CMMD_scan_double, the 
combination of smode = CMMD_s tar t_bi t and inclusion = 
CMMD_exclusive is currently illegal and will cause the nodes 
to exit. 


Direction and inclusion 

The direction argument determines the direction of the scan, either upwards 
(from 0 to the highest-numbered node) or downwards (from the highest-num¬ 
bered node to 0). The inclusion argument determines whether a given node 
participates in its own value. When smode is CMMD_none, these two arguments 
alone work together to define which source values affect the destination value in 
a given processor. 

■ In inclusive upward scans, the value returned for a given node n is the 
combination of the source values in all nodes <= n. 

■ In inclusive downward scans, the value returned for a given node n is the 
combination of the source values in all nodes >= n. 

■ In exclusive upward scans, the value returned for a given node n is the 
combination of source values in all nodes < n. The first (lowest-numbered) 
node receives the identity value for the combiner. 

■ In exclusive downward scans, the value returned for a given node n is the 
combination of the source values in all nodes >n. The highest-numbered 
node receives the identity value for the combiner. 

If a scan is a segmented scan, these rules apply on a per-segment basis, as ex¬ 
plained below. 


torsion 1.1, January 1992 





54 


CMMD Reference Manual 


Smode and Sbit 

The smode and sbit arguments define segmented scans. These are scans that tally 
their results across subgroups of the nodes. Every node belongs to one group, or 
“segment,” with the group to which it belongs determined by smode and sbit as 
follows: 

■ When smode is CMMD_segment_bit 

If smode is CMMD_segment bit, then sbit is considered a segment bit. A 
nonzero segment bit starts a new segment for an upward scan, but ends a 
segment for a downward scan. Imagine 8 nodes with the following seg¬ 
ment bits: 

0 0 1 0 0 1 0 0 

Both upward and downward scans would have 3 segments: one would in¬ 
clude nodes 0 and 1, another would include nodes 2—4, and a third would 
include nodes 5-7. 

When sbit is a segment bit, operations in one segment never affect the val¬ 
ues of elements in ano th er segment. Thus, given segment bits of 

0 0 10 

and values of 

4 15 2 

an upward exclusive max would produce 
0 4 0 5 

(See Figure 5.) 

■ When smode is CMMD_s tart__bi t 

If smode is CMMD startjbit, then sbit is considered a start bit. A non¬ 
zero start bit always starts a new segment, whether the direction of the scan 
is upward or downward. Thus, given 8 nodes with the following start bits: 

0 0 1 0 0 1 0 0 

an upward scan would have the same segments as the segmented scan 
shown above (0-1,2-4, 5-7); but a downward scan would have segments 
of 7-6, 5-3, and 2-0. 
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In addition, if the operation is exclusive, a node with a nonzero start bit 
does receive a value from the preceding segment. The value received is the 
reduce of the previous segment, with the same combiner. Thus, given start 
bits of 

00 1 0 

and values of 

4 15 2 

an upward exclusive scan/max would produce 
0445 

(See Figure 5.) 



Segment 0 Segment 1 


4 | 1 | | S | 2 



Segment 0 Segment 1 
4 I 1 I | 5 | 2 

Vx \ 

0 | 4 | [ 4 | 5 | 


Figure 5. Upward exclusive scans with max combiners. 
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9.5 Concatenation Operations 

CMHD_concat_with_nodes {void *element, void *buffer, int elemjength) 

*element A pointer to the element this node contributes to the 

concatenation process. 

*buffer A pointer to the buffer in which the returned value will 

be stored. Its length in bytes must be at least 
(CMMD_partition_size () * elemjength). 

elem length The length in bytes of the element to be concatenated. 

Must be identical across all nodes. 

CMMD_concat_with_nodes concatenates elements of equal length from each 
node into the given buffer. The length of the buffer in bytes must be at least 
(CMMD_partition_size () * elemjength). This function does not return until 
all nodes have called CMMD_concat_with_nodes. The host processor is not 
involved. 


CMMD_gather_from_nodes (void *buffer, int elemjength) 
CMMD_concat_elements_to_host {void *element, int elemjength) 

*buffer (Host only.) A pointer to the buffer in which the returned 

value will be stored. Its length in bytes must be at least 
(CMMD_partition_size() * elemjength). 

*element (Nodes only.) A pointer to the element this node contrib¬ 

utes to the concatenation process. 

elem jength The length in bytes of the element to be concatenated. 

(Must be identical for all processors.) 

This pair of functions concatenates elements from each node into a buffer on the 
host. The element length must be identical for all processors, and the host must 
specify enough space to store the result. The function returns after all nodes have 
called CMMD_concat_elements_to_host and the host has called CMMD_ 
gather_from_nodes. 

Note that these functions are essentially the opposite of the functions CMMD 
distrib to nodes and CMMD_receive_elexnent_from_host. That pair 
distributes the contents of a buffer element-wise from the host to the nodes; this 
pair gathers the elements from the nodes into a buffer on the host. 
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Routines That Let You 
Create Your Own Protocol 


w mm mmmmm&rzmmsmMmmMmmmmmjmmmmmmmmmMmmmmmmmm 

PLEASE NOTE 

(1) The routines documented in this appendix, CMMD_sand_ 
packet and CMMD_race±vejpacket, cannot be used in con¬ 
junction with other CMMD send and receive routines. The 
library provides no protection against doing so, but results are 
likely to be indeterminate. CMMD global functions, on the other 
hand, can be used with these packet routines. 

(2) Creating a message-passing protocol is not a simple opera¬ 
tion. Deadlocks are not only possible, they are extremely likely. 
Please do not use these routines unless you have very good rea¬ 
sons for doing so, and are experienced at message-passing 
multiprocessor programming. 


Using CMMD_send_packet and CMMD_receive_packet, nodes can send and 
receive non-blocking messages of up to 20 bytes in length. The routines provide 
no synchronicity, nor any functionality to verify whether a message, once sent, 
is received somewhere. Users employing these routines must ensure that any 
messages sent by them are received; unreceived messages can clog the data net¬ 
work and cause the program to hang. 
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CMMD_sand_paclcet and CMMD_raceive_packet make no provision for 
headers. Users must create their own headers and their own software to parse 
whatever header-and-text combination they decide to use. 


A.1 The Packet Routines 

CMMD_sendjpacket (unsigned int destination, int words_to_send, 
unsigned int *buffer, unsigned int type) 

An integer identifying the node to which the message is 
to be sent. 

The length of the buffer, in 32-bit words. 

A pointer to a buffer that contains die message to be sent. 

At this release, 0 is the only allowable value for this argu¬ 
ment 

This function sends out a message to the destination node. Arguments specify the 
length (expressed in 32-bit words) of the packet and the starting address of the 
message. 

The function is non-blocking. It does not wait for any acknowledgment from the 
receiver. It returns TRUE if the message has been sent into the communications 
network, FALSE otherwise. 


destination 

words_to_send 

*bujfer 

type 


CMMD_receive_packet (unsigned int *buffer ) 

*buffer A pointer to a buffer that contains the message to be 

received. 

The function checks for incoming messages. If it finds one, it receives the mes¬ 
sage, writes it into the buffer, and returns the number of words received. If it finds 
no incoming message, it returns -1. 
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